Unsupervised Classification with Stochastic Complexity

نویسندگان

  • JORMA RISSANEN
  • ERIC SVEN RISTAD
چکیده

1 Problem Statement In unsupervised classication, we are given a collection of samples and must label them to show their class membership, without knowing anything about the underlying data generating machinery, not even the number of classes. That is, we are given some sequence of observed objects 1; 2 Now we must assign each object to one of a number of classes C = C 1 ; : : : ; C c in an \optimal" fashion. A solution to this problem, then, consists of (i) a measure of the quality of a given classication, and (ii) an algorithm for classifying a given set of objects. Two popular denitions of optimality are predictive error and data reduction. \Predictive error" measures the ability of the classier to correctly predict the likelihood and class membership of future samples. According to this criterion, the goal of classication is to eciently discretize the continuous k-dimensional space of measurements in a manner that preserves the probability density of that space. A typical application is vector quantization of speech signals. \Data reduction" measures the classier's ability to oer insight into large collections of high-dimensional data. According to this criterion, the goal of classication is to reduce an infeasibly large collection of data to a smaller, more feasible set in a manner that preserves the underlying structure of the original collection. A typical application is visualization of census data. Unsupervised classication, sometimes called \clustering," is an essential tool in data analysis and pre-theoretical scientic inquiry. It has important applications in many elds of study, includ-The problem of unsupervised classication presents us with two profound diculties. The rst diculty is that the observations are not labeled with their classes, and therefore we do not know the correct number of underlying classes. If our classier postulates too few or too many classes, then it distorts the density of the measurement space. It's ability to predict future data will suer, as will its ability to oer insight into the global structure of the observations. The second diculty inherent to unsupervised classication is one of computational complexity. In order to preserve the probability density of the measurement space, the number of classes and

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gaussian Processes for Big Data through Stochastic Variational Inference

Gaussian processes [GP 10] are perhaps the dominant approach for inference on functions. They underpin a range of algorithms for regression, classification and unsupervised learning. Unfortunately, exact inference in a GP has complexity O(n) with storage demands of O(n) and this hinders application of these models for ‘big data’. Various approximate techniques have been suggested [see e.g. 1, 1...

متن کامل

Learning Genetic Population Structures Using Minimization of Stochastic Complexity

Considerable research efforts have been devoted to probabilistic modeling of genetic population structures within the past decade. In particular, a wide spectrum of Bayesian models have been proposed for unlinked molecular marker data from diploid organisms. Here we derive a theoretical framework for learning genetic population structure of a haploid organism from bi-allelic markers for which p...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

Evaluating the Effectiveness of Supervised and Unsupervised Classification Methods in Monitoring Regs (Case Study: Jazmourian Reg)

Due to its mobility and ability to move and its direct impact on residential areas and various developmental activities, the Ergs are of major importance in the desert areas, so monitoring of those is very important. Considering that the use of supervised and unguarded methods is considered as one of the most common methods in determining and monitoring land uses, in this research, the accuracy...

متن کامل

Bayesian unsupervised classification framework based on stochastic partitions of data and a parallel search strategy

Advantages of statistical model-based unsupervised classification over heuristic alternatives have been widely demonstrated in the scientific literature. However, the existingmodel-based approaches are often both conceptually and numerically instable for large and complex data sets. Here we consider a Bayesian model-based method for unsupervised classification of discrete valued vectors, that h...

متن کامل

Stochastic Unsupervised Learning on Unlabeled Data

In this paper, we introduce a stochastic unsupervised learning method that was used in the 2011 Unsupervised and Transfer Learning (UTL) challenge. This method is developed to preprocess the data that will be used in the subsequent classification problems. Specifically, it performs K-means clustering on principal components instead of raw data to remove the impact of noisy/irrelevant/less-relev...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992